eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713

albrow · 2018-05-08T22:10:23Z

This small change fixes a bug in tx_pool.go which causes geth to crash in certain cases when debug.setHead is used.

I discovered this bug while running a single-node, private PoA network for testing purposes. Here are the steps I took to consistently produce the crash:

Start a single-node, private PoA network (I used puppeth and followed this guide as a starting point). Be sure to enable the RPC API with the flag: --rpcapi 'miner,debug'.
Mine at least one block. If you need to start the miner, you can use the "miner_start" endpoint of the JSON RPC API.
Stop the miner via the "miner_stop" endpoint.
Set the head to a previous block number via the "debug_setHead" endpoint.
Start the miner again.
Geth crashes with a stack trace that looks like the following:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x447559f]

goroutine 75 [running]:
github.com/ethereum/go-ethereum/core/types.(*Block).NumberU64(...)
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/types/block.go:310
github.com/ethereum/go-ethereum/core.(*TxPool).reset(0xc420241dc0, 0xc420578b40, 0xc420578d80)
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/tx_pool.go:370 +0x6bf
github.com/ethereum/go-ethereum/core.(*TxPool).loop(0xc420241dc0)
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/tx_pool.go:289 +0x504
created by github.com/ethereum/go-ethereum/core.NewTxPool
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/tx_pool.go:253 +0x4ed

I believe this is happening because when the ChainHeadEvent is received in tx_pool.go, the variable head refers to a block that no longer exists. We can detect this cause by checking if the block number of head is greater than the current block number. And if that is the case, we can fix it by simply setting head to the current block.

After making this change, I tested it manually using the steps described above and observed that geth no longer crashes and continues to mine starting from the new block number. I also ran make test to ensure that all the existing tests still pass (they do). However, there still could be some unintended side effects I'm not aware of.

If we would like to add a test case for this, I would be happy to take a stab at it and include it in this PR. I just need some advice on where the test should live and how to get started on writing it.

karalabe · 2018-05-10T10:19:46Z

core/tx_pool.go

@@ -286,6 +286,12 @@ func (pool *TxPool) loop() {
 				if pool.chainconfig.IsHomestead(ev.Block.Number()) {
 					pool.homestead = true
 				}
+				if pool.chain.CurrentBlock().NumberU64() < head.NumberU64() {


Wouldn't it make more sense to use ev.Block instead of pool.chain.CurrentBlock()? It's faster and should be the same thing.

Sure thing, I just didn't realize they were the same. Updated.

GitCop · 2018-05-10T18:06:13Z

Thank you for your contribution! Your commits seem to not adhere to the repository coding standards

Commit: 63e6250dde40a46da3506c9150228339400b22dc
Commits must be prefixed with the package(s) they modify

Please check the contribution guidelines for more details.

This message was auto-generated by https://gitcop.com

albrow · 2018-05-14T20:59:59Z

Any additional feedback here?

It looks like the CI tests are failing due a timeout unrelated to my changes.

albrow · 2018-05-15T22:54:00Z

After running more of our smart contract tests against my fork, I realized Geth was still crashing in some cases after calling debug.setHead. Even stranger, I observed some cases where both ev.Block and pool.chain.CurrentBlock() had block numbers that were too high and did not actually exist in the chain, which resulted in nil pointer dereferences. In addition, after calling debug.setHead the miner often either logged an error and stopped or simply hung waiting for new transactions (depending on how the network is configured and whether it's using Ethash or Clique for consensus).

Rather than trying to track down every place in the code where we were holding onto some pointers to blocks, I decided to opt for a different approach. This new PR modifies EthAPIBackend.SetHead to stop and re-initialize the miner, txpool, and consensus engine. This guarantees that we don't run into any problems with those parts of the codebase holding onto old state, but at the cost of making SetHead slower and computationally more expensive.

I've updated my commit message and PR title to reflect the fact that the changes now lie in the "eth" package instead of the "core" package.

I'm not sure if this covers everything that could be affected by SetHead, or if there is a more efficient way to do this. I do know that with these changes, Geth is no longer crashing or stalling out for our use case. I look forward to hearing your feedback, and I'm happy to make further changes as needed.

holiman · 2018-05-25T09:14:31Z

On the one hand, this looks clean and the setHead operation is in the area "Use at your own risk, may break things", so I'm inclined to accept this PR if it works for you.

However, I'm a bit curious what happens with all existing channels and go-routines coupled to the old txpool and miner that you simply discard. I'm thinking that maybe that should be cleaned up, otherwise things could potentially get very weird.

Could you try to e..g call setHead 100 times in a row, check what happens?

albrow · 2018-05-29T20:12:54Z

@holiman Our tests call setHead about 200 times and I am able to run them multiple times in a row without any apparent issues. I also have been monitoring memory usage and it does not appear to increase (though it's possible the increase is too small to detect with htop due to rounding).

I assumed calling b.eth.miner.Stop() and b.eth.txPool.Stop() would clean up any resources like goroutines and channels. Is that not the case?

albrow · 2018-05-29T20:18:49Z

eth/api_backend.go

 	b.eth.blockchain.SetHead(number)
+	b.eth.txPool = core.NewTxPool(b.eth.config.TxPool, b.eth.chainConfig, b.eth.blockchain)
+	if b.eth.chainConfig.Clique != nil {


At my company, we use a single node network for testing with Clique as the consensus engine. I'm not sure how to handle Ethash or other consensus engines here. Is that something we need to consider?

albrow · 2019-01-22T23:53:44Z

@holiman or @karalabe I believe I've answered your questions and responded to your comments. Do you have any additional feedback here?

For the past 6-7 months at 0x, we've been running tests against our fork of Geth which includes these changes. We haven't experienced any major issues. Our test suite is extensive and hits the debug_setHead endpoint ~1,000 times for each test run.

We have reached the point where we have some other developers in the 0x ecosystem using our fork of Geth and relying on the debug_setHead feature (as well as the debug_increaseTime feature introduced in #16834). If at all possible, we would like to merge these changes upstream to make it easier for everyone.

Edit: To clarify, the original issue that I wanted to fix with this PR appears to still exist (at least in some form; I'm not sure if there are still cases where Geth crashes or if it only hangs now). If we try running our test suite against the latest release (as of this comment, v1.8.21), Geth hangs/does not mine any new blocks and the tests are unable to complete.

holiman · 2019-01-24T10:04:20Z

Commen from review discussion:

setHead should ideally be identical to a deep reorg, so it would be better if the sethead functionality used the same notification mechanisms / channels as are triggered during a reorg. If some component, like txpool, doesn't handle such a reorg properly, then that needs to be fixed. Makes sense?

albrow · 2019-01-29T19:44:11Z

@holiman Thank you for the response. I agree with what you're saying in theory. Before I made any changes, setHead was pretty much implemented as a chain re-org in the way you described. However, as I have shown, this leads to crashes/hangs in other parts of the code that don't expect sudden changes to the chain.

I don't necessarily have the knowledge of Geth internals required to debug and fix issues with the chain re-org approach. This is especially true when you have complicated interactions between different parts of the codebase (e.g., as I described above ev.Block and pool.chain.CurrentBlock() having block numbers that are too high, which results in nil pointer exceptions when you try to lookup the corresponding block in the chain).

Consider this: This PR addresses a specific, reproducible bug with setHead. The current version of setHead as implemented in the master branch does not work. The changes in this PR are isolated to a small part of the code (one single file) and will be easy to remove in the future in case someone comes up with a better fix. The fix in this PR does not result in any resource leaks as far as I can tell, and the only real downside is a reduction in performance (I can't speak for everyone, but for our test suite the performance impact is negligible). Would you consider merging this as a stopgap measure?

Otherwise, if you really don't want to merge this PR as is, I will try my best to address your concerns and implement setHead using the existing chain re-org mechanisms instead of restarting everything. This would take some additional time and means I would have to address the issues in tx_pool.go and any other place where there are potentially outdated pointers to blocks. Could you point me to some examples where one part of the codebase gets notified of a chain re-org and handles it appropriately? Any additional info you could share to help me better understand what is going wrong?

albrow · 2019-02-26T01:24:44Z

To clarify on the current status of this PR: It is currently not working due to merge conflicts/upstream changes. It will take some significant work for me to rebase and fix any issues, so I'm waiting for answers to my questions and some consensus on the approach before continuing. I'm happy to continue working on this PR. Just need some feedback first.

karalabe · 2019-03-07T10:01:23Z

Potentially fixed by #19216 in a cleaner way. That one reproduces the issue more deterministically, we might be able to fix this in the txpool fully isolated.

adamschmideg · 2019-03-21T10:13:26Z

We're trying to reproduce it and create an issue based on that.

holiman · 2019-03-26T10:19:05Z

Superseded by #19308

albrow · 2019-03-28T00:35:15Z

@holiman @karalabe While I am happy that you were able to find a cleaner solution, it unfortunately still doesn't fix all the problems with debug_setHead. See #19346.

albrow requested review from holiman and karalabe as code owners May 8, 2018 22:10

albrow force-pushed the sethead-txpool-fix branch from a42b17b to 6dc0202 Compare May 9, 2018 18:22

karalabe reviewed May 10, 2018

View reviewed changes

albrow force-pushed the sethead-txpool-fix branch from 63e6250 to d10ec7e Compare May 10, 2018 18:10

eth: reset miner, engine, and txpool when calling SetHead

217edb2

albrow force-pushed the sethead-txpool-fix branch from d10ec7e to 217edb2 Compare May 15, 2018 22:35

albrow changed the title ~~core: fix a crash in tx_pool.go when debug.setHead is used~~ eth: reset miner, engine, and txpool when calling SetHead to prevent crashes May 15, 2018

albrow force-pushed the sethead-txpool-fix branch from f8db324 to 217edb2 Compare May 29, 2018 20:05

albrow commented May 29, 2018

View reviewed changes

adamschmideg added the status:triage label Jan 23, 2019

adamschmideg added need:more-information and removed status:triage labels Jan 24, 2019

ligi added status:triage and removed need:more-information labels Feb 19, 2019

adamschmideg assigned fjl Feb 28, 2019

adamschmideg added need:more-information and removed status:triage labels Mar 21, 2019

holiman self-assigned this Mar 21, 2019

adamschmideg unassigned fjl Mar 21, 2019

adamschmideg added need:steps-to-reproduce and removed need:more-information labels Mar 21, 2019

holiman mentioned this pull request Mar 21, 2019

core: make txpool handle reorg due to setHead #19308

Merged

holiman closed this Mar 26, 2019

albrow mentioned this pull request Mar 28, 2019

debug_setHead causes Clique to hang when period is 0 #19346

Closed

albrow mentioned this pull request May 23, 2019

devnet: Is packages still maintained? 0xProject/0x-monorepo#1831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713

eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713

albrow commented May 8, 2018

karalabe May 10, 2018

albrow May 10, 2018

GitCop commented May 10, 2018

albrow commented May 14, 2018

albrow commented May 15, 2018

holiman commented May 25, 2018

albrow commented May 29, 2018 •

edited

Loading

albrow May 29, 2018 •

edited

Loading

albrow commented Jan 22, 2019 •

edited

Loading

holiman commented Jan 24, 2019

albrow commented Jan 29, 2019

albrow commented Feb 26, 2019

karalabe commented Mar 7, 2019

adamschmideg commented Mar 21, 2019

holiman commented Mar 26, 2019

albrow commented Mar 28, 2019 •

edited

Loading

eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713

eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713

Conversation

albrow commented May 8, 2018

karalabe May 10, 2018

Choose a reason for hiding this comment

albrow May 10, 2018

Choose a reason for hiding this comment

GitCop commented May 10, 2018

albrow commented May 14, 2018

albrow commented May 15, 2018

holiman commented May 25, 2018

albrow commented May 29, 2018 • edited Loading

albrow May 29, 2018 • edited Loading

Choose a reason for hiding this comment

albrow commented Jan 22, 2019 • edited Loading

holiman commented Jan 24, 2019

albrow commented Jan 29, 2019

albrow commented Feb 26, 2019

karalabe commented Mar 7, 2019

adamschmideg commented Mar 21, 2019

holiman commented Mar 26, 2019

albrow commented Mar 28, 2019 • edited Loading

albrow commented May 29, 2018 •

edited

Loading

albrow May 29, 2018 •

edited

Loading

albrow commented Jan 22, 2019 •

edited

Loading

albrow commented Mar 28, 2019 •

edited

Loading